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Abstract 

We  present  a  framework  for  estimating  3D  relative  struc¬ 
ture  (shape)  and  motion  given  objects  undergoing  nonrigid 
deformation  as  observed  from  a  fixed  camera ,  under  per¬ 
spective  projection.  Deforming  surfaces  are  approximated 
as  piece-wise  planar,  and  piece-wise  rigid .  Robust  reg¬ 
istration  methods  allow  tracking  of  corresponding  image 
patches  from  view .  to  view  and  recovery  of  3D  shape  de¬ 
spite  occlusions,  discontinuities,  and  varying  illumination 
conditions.  Many  relatively  small  planar/rigid  image  patch 
trackers  are  scattered  throughout  the  image;  resulting  es¬ 
timates  of  structure  and  motion  at  each  patch  are  com¬ 
bined  over  local  neighborhoods  via  an  oriented  particle 
systems  formulation.  Preliminary  experiments  have  been 
conducted  on  real  image  sequences  of  deforming  objects 
and  on  synthetic  sequences  where  ground  truth  is  known. 

1  Introduction 

Estimation  of  3D  structure  (shape)  and  motion  from  2D 
image  sequences  has  been  a  central  problem  in  computer 
vision  for  many  years.  Many  early  studies  focused  on 
methods  of  relating  pixel  coordinates  to  3D  coordinates  via 
camera  calibration  [51,  54],  that  is  computing  the  projec¬ 
tion  matrix  which  relates  image  coordinates  to  a  world  co¬ 
ordinate  frame.  In  recent  years,  the  focus  has  shifted  to 
non-metric  reconstruction  from  uncalibrated  cameras  [25], 
by  computing  the  fundamental  matrix  (two  views)  [28], 
and  the  trilinear  tensor  (three  views)  [42].  Also,  different 
camera  models  were  assumed;  i.e.,  orthographic  [49,  53], 
perspective  projection  [27,  54],  or  a  unified  model  [4, 41], 

Determining  the  geometric  relationship  between  various 
views  of  the  environment  and  its  3D  structure  is  a  key  com¬ 
ponent  in  a  myriad  of  practical  applications:  reverse  en¬ 
gineering,  virtual  reality,  visualization,  surgical  planning, 
movie  special  effects,  computer  aided  design,  non-tactile 
inspection,  manufacturing,  image  compression,  etc.  When 
3D  shape  and  motion  estimates  are  computed  in  real  time, 
they  can  be  used  to  support  applications  where  a  computer 
(or  robot)  must  interact  with  its  environment:  manipula¬ 
tion,  navigation  and  control,  tracking,  etc .  Furthermore, 
such  estimates  can  be  utilized  to  determine  the  locations, 
postures,  and  configurations  of  humans  in  order  to  enable 
a  computer  to  assist  (or  avoid  hampering)  in  a  task. 

Despite  the  many  exciting  applications  and  the  energetic 
progress  of  research  in  structure  and  motion  recovery  al¬ 
gorithms,  many  problems  remain  unsolved.  Some  of  these 
issues  are  related  to  numerical  stability  and/or  ambiguity 
of  the  solution  under  general  conditions  [2,  33,  43,  45,  54, 
55].  Other  problems  stem  from  the  rich  variety  of  shapes 


and  motions  that  are  possible  in  the  world.  In  particular, 
many  shapes  can  be  non-planar  and/or  their  motion  can  be 
nonrigid.  Unfortunately,  all  of  the  above-mentioned  ap¬ 
proaches  assume  that  object  points  in  3D  space  must  re¬ 
main  at  fixed  distances  from  each  other  during  motion. 

Our  goal  is  to  extend  these  approaches  to  non-rigid  ob¬ 
jects.  We  propose  a  method  for  recovering  3D  shape  and 
motion  estimates  for  objects  undergoing  nonrigid  deforma¬ 
tion  as  observed  from  a  fixed  camera,  under  perspective 
projection.1  A  natural  first  step  to  take  towards  solving  this 
problem  is  to  assume  that  the  deforming  object  consists  of 
small  patches  that  are  rigid  and  planar  when  considering 
small  enough  regions.  In  other  words,  we  will  employ  a 
representation  where  deforming  surfaces  will  be  approxi¬ 
mated  as  piece- wise  planar,  and  piece- wise  rigid. 

A  second  assumption  common  to  many  of  these  ap¬ 
proaches  is  that  correspondence  between  features  in  dif¬ 
ferent  views  is  given.  As  will  be  outlined  later,  we  utilize  a 
tracker  that  automatically  registers  moving  image  patches 
from  frame  to  frame  [38].  Each  corresponding  warped  im¬ 
age  patch  is  then  used  directly  in  estimating  the  3D  orien¬ 
tation  of  the  piece- wise  planar  surface  patch,  and  its  3D 
position  up  to  a  scale  factor.  A  robust  image  registration 
formulation  provides  stability  to  shadows,  highlights,  and 
partial  occlusions.  Furthermore,  changes  in  illumination 
are  modeled  explicitly. 

Two  different  approaches  for  acquiring  piece-wise 
rigid/planar  models  are  possible:  top-down  and  bottom- 
up.  In  the  top-down  method,  the  initial  hypothesis  could 
be  that  an  object's  motion  can  be  adequately  modeled  as 
a  single  moving  rigid/planar  patch  [3];  the  model  would 
then  be  subdivided  and  augmented  as  needed  to  account 
for  non-planar/non-rigid  motion  via  an  adaptive  triangula¬ 
tion  procedure.  In  the  second,  bottom-up  approach,  many 
relatively  small  planar/rigid  image  patch  trackers  could 
be  scattered  throughout  the  image;  resulting  estimates  of 
structure  and  motion  at  each  patch  would  then  be  combined 
over  local  neighborhoods  via  an  extension  of  Szeliski' s  ori¬ 
ented  particle  systems  formulation  [16,  46], 

In  our  preliminary  system,  we  have  developed  the 
bottom-up  approach,  and  will  report  these  results.  The 
botton-up  framework  is  evaluated  using  synthetic  data  in 
which  ground  truth,  deformation,  and  noise  levels  are 
known.  The  method' s  efficacy  is  also  demonstrated  on  real 
image  sequences  of  deforming  objects.  Implementation  of 
the  top-down  approach,  and  experimental  comparison  of 
both  strategies,  is  saved  as  future  work 

llt  is  assumed  that  self  calibration  of  the  camera  will  be  given  or  ob¬ 
tained  via  a  standard  techniaue  (e.e..  rwn 
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2  Background 

The  many  years  of  work  in  structure  from  motion  have 
led  to  significant  advances  in  recovery  of  detailed,  texture 
mapped  models  and  motion  estimates  from  video  to  sup¬ 
port  graphics,  visualization,  and  compression.  A  number 
of  researchers  have  demonstrated  systems  that  can  recover 
planar  models  and  texture  maps  from  image  streams;  e.g., 
[4, 5, 12, 24]  to  name  a  few.  Other  researchers  have  demon¬ 
strated  methods  for  recovering  polygonal  models  of  an  ob¬ 
ject  that  is  positioned  on  a  rotating  platform  [1,  40,  44], 

Other  approaches  focus  on  the  problem  of  structure  from 
tracked  feature  points  (or  lines)  with  known  correspon¬ 
dence  from  two  or  more  frames,  under  orthographic  or 
perpsective  projection  [15,  54],  If  desired,  a  polygonal 
model  can  be  recovered  from  the  resulting  collection  of 
unorganized  3D  point  position  estimates  via  triangulation 
[8,  15, 20]  or  via  surface  approximation  [14,  26]. 

In  point  based  methods,  feature  tracking  and  correspon¬ 
dence  is  assumed.  Such  tracking  can  be  attained  via  any 
number  of  techniques.  Typically,  image  correlation  or  sum 
of  squared  differences  methods  are  used  [50].  A  point  fea¬ 
ture  is  essentially  a  small  image  patch,  which  is  tracked  by 
optimizing  some  matching  criterion  with  respect  to  transla¬ 
tion  or  affine  image  deformation.  Selection  of  good  points 
to  track  can  be  based  on  a  number  of  factors,  including  cor¬ 
ners,  texture,  sufficient  zero  crossings  in  the  Laplacian  of 
image  intensity,  etc.  [50].  Unfortunately,  even  a  “good” 
feature  can  be  difficult  to  track  if  it  lies  on  a  depth  discon¬ 
tinuity,  or  across  the  boundary  of  a  specular  highlight,  or  if 
it  is  occluded  during  tracking.  Such  problems  beg  the  use 
of  smaller  feature  windows,  since  smaller  windows  tend  to 
be  less  likely  to  straddle  discontinuities.  However,  there 
is  a  tradeoff:  estimates  based  on  smaller  windows  tend  to 
be  more  susceptible  to  noise  and  outliers,  since  there  are 
fewer  pixels  per  feature  window  tracked. 

Another  set  of  methods  is  based  on  image  registra¬ 
tion.  Take  for  example,  the  plane  plus  parallax  methods  of 
[3,  11,  22,  37].  These  methods  exploit  a  dominant  planar 
motion  to  compute  the  epipoles  and  perform  a  projective 
reconstruction.  Such  methods  can  use  robust  minimization 
methods  [6]  to  overcome  the  influence  of  outliers. 

All  of  the  methods  mentioned  so  far  assume  rigid  mo¬ 
tion  in  order  to  recover  a  model.  This  limits  the  utility  of 
the  above  methods  to  recovery  of  rigid  structure  and  mo¬ 
tion  estimates.  In  images,  the  deformational  motion  of  ob¬ 
jects  is  sometimes  due  to  changes  in  viewing  geometry.  In 
many  such  cases,  the  above  mentioned  methods  are  suffi¬ 
cient.  However,  in  general,  these  parameterizations  are  in¬ 
adequate  for  representing  motions  that  arise  due  to  a  gen¬ 
eral  nonrigid  deformation.  For  instance,  most  biological 
objects  are  flexible  and  articulated:  fingers  bend,  cheeks 
bulge,  fish  swim,  trees  sway  in  the  breeze,  etc .  Shapes  are 
stretched,  bent,  tapered,  dented,  etc.,  and  so  it  seems  logi¬ 
cal  to  employ  a  model  that  can  encode  the  ways  in  which 
real  objects  deform. 


This  rationale  led  to  the  development  of  3D  active  shape 
models  [48].  These  models  utilize  a  predefined  structure 
that  incorporates  prior  knowledge  about  a  shape’s  smooth¬ 
ness  and  its  resistance  to  deformation.  A  number  of  dif¬ 
ferent  3D  deformable  model  formulations  have  been  pro¬ 
posed;  e.g .,  deformable  tubes  [32,  48],  ellipsoidal  models 
[9,  30],  superquadrics  [31, 36],  etc .  Perhaps  the  major  lim¬ 
itation  of  such  methods  is  the  requirement  that  every  ob¬ 
ject  be  described  as  the  deformations  of  a  single  prototype 
object.  This  limits  the  kinds  of  shapes  (and  topologies) 
that  can  be  recovered  in  general,  since  we  can  only  recover 
shapes  that  are  achievable  via  the  specific  geometric  model 
and  nonrigid  motion  formulation. 

Some  researchers  attempt  to  overcome  this  limitation 
through  the  use  of  more  general,  3D  deformable  part  de¬ 
compositions  [35],  local  deformations  [30,  31,  34],  shape 
evolution  models  [13],  or  adaptive  subdivision  [21, 23, 52]. 
These  methods  offer  greater  generality,  but  are  still  some¬ 
what  limited  in  the  shapes  and  deformations  they  can  de¬ 
scribe  in  general.  Furthermore,  these  techniques  some¬ 
times  require  careful  initial  placement  of  the  model,  reli¬ 
able  feature  detection  for  model-image  correspondence,  or 
the  delicate  choice  of  model  parameters  (e.g.,  stiffness). 

A  second  assumption  common  to  many  of  the  above  ap¬ 
proaches  is  that  the  correspondence  between  features  in 
the  different  views  is  known.  To  get  around  this  problem, 
we  will  use  a  tracker  that  automatically  determines  corre¬ 
spondence  via  registration  of  image  patches  from  frame  to 
frame,  as  described  in  the  next  section. 

3  Tracking  Deforming  Image  Patches 

A  key  component  of  the  proposed  approach  is  tracking  vis¬ 
ible  parts  of  objects  from  frame  to  frame.  A  promising 
family  of  approaches  is  based  on  tracking  of  deforming  im¬ 
age  regions  [7,  17, 18,  38,  39].  These  approaches  integrate 
information  over  an  image  patch,  and  therefore  tend  to  be 
more  immune  to  noise  and/or  low-contrast,  especially  if  a 
robust  estimator  formulation  is  employed  [6].  Typically, 
use  of  a  robust  approach  requires  batch  processing,  though 
multiscale  techniques  offer  some  hope  for  realtime  perfor¬ 
mance.  Real-time  approaches  for  tracking  of  parameter¬ 
ized  patches  have  been  developed  [17,  18];  however,  they 
do  not  address  general  nonrigid  motion  tracking. 

3.1  Active  Blobs  Formulation 

More  general  nonrigid  motion  tracking  can  be  accom¬ 
plished  via  the  active  blobs  formulation  of  [38].  The 
formulation  provides  robustness  to  occlusions,  wrinkles, 
shadows,  and  specular  highlights.  Furthermore,  it  is  tai¬ 
lored  to  take  advantage  of  texture  mapping  hardware  avail¬ 
able  in  many  workstations,  PC's,  and  game  consoles.  This 
enables  nonrigid  tracking  at  speeds  approaching  video  rate. 

In  the  active  blobs  formulation,  shape  of  the  image  patch 
is  modeled  with  a  deformable  triangular  mesh.  The  con¬ 
struction  of  an  example  active  blob  model  is  shown  in 
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Figure  1 :  Construction  of  an  example  image  patch  model  via 
active  blobs .  From  left  to  right:  a.)  input  image  with  region  of 
interest  overlaid,  b.)  triangle  mesh  model,  c.)  texture  mapped 
model. 

Fig.  1.  Fig.  1(a)  shows  the  first  image  in  a  sequence  with 
regions  of  interest  outlined.  A  2D  active  triangular  image 
patch  model  is  then  constructed  for  the  region  of  interest  as 
shown  in  Fig.  1(b).  The  blob's  appearance  is  then  captured 
as  a  color  texture  map  and  applied  directly  to  the  triangu¬ 
lated  model  as  shown  in  Fig.  1(c). 

For  tracking,  the  active  blob  model  is  warped  such  that  it 
is  registered  with  the  incoming  image  sequence.  Warping 
is  defined  as  a  deformation  of  the  triangular  mesh  and  then 
a  bilinear  resampling  of  the  texture  mapped  triangles.  In 
essence,  texture  mapping  is  used  to  define  a  warping  func¬ 
tion  for  the  input  image,  I: 

I7  =  cW  (I,  u)  +  6  =  W(I,  a),  (1) 

where  u  is  a  vector  containing  deformation  parameters, 
and  b  and  c  model  brightness  and  contrast  variations. 
For  notational  convenience,  we  concatenate  the  parameters 
u,  b,  c  together  in  a  generic  parameter  vector  a,  and  define 
a  generic  warping  function  W.  In  our  current  system,  the 
photometric  correction  terms  are  defined  as  bilinear  func¬ 
tions  that  scale  the  red,  green,  and  blue  channels  equally. 

Perhaps  the  simplest  deformation  functions  to  be  used 
in  Eq.  1  are  those  of  an  eight  parameter  projective  model. 
Such  functions  are  suitable  for  approximating  the  rigid  mo¬ 
tion  of  a  planar  patch.  However,  since  the  piece- wise  pla¬ 
nar/rigid  assumption  is  likely  to  be  violated,  we  utilize  a 
parameterization  that  can  accommodate  greater  variability. 

A  more  general  parameterization  of  nonrigid  motion  can 
be  obtained  via  the  modal  representation  [34],  where  de¬ 
formation  is  represented  in  terms  of  eigenvectors  of  a  finite 
element  (FE)  model.  The  underlying  FE  formulation  offers 
the  added  advantage  that  it  can  be  used  in  obtaining  a  reg¬ 
ularized  solution  to  the  nonrigid  tracking  problem.  For  a 
given  modal  parameter  vector  obtained  in  tracking,  we  can 
compute  the  strain  energy  associated  with  deformation: 

m 

Ej strain  “  ^  ,  (2) 

i=i 

where  t oj  is  the  stiffness  associated  with  the  jth  modal  de¬ 
formation  parameter.  Note  that  these  stiffnesses  are  deter¬ 
mined  directly  from  the  FE  shape  model  [34,  38]. 

Recall  that  in  Eq.  1 ,  we  concatenate  the  deformation  and 
lighting  parameters  u,  b,  c  together  in  a  generic  parame¬ 


ter  vector  a.  Therefore,  generalized  stiffnesses  are  needed. 
We  define  a  diagonal,  generalized  stiffness  matrix  VP  that 
contains  the  modal  stiffnesses  toj  and  stiffnesses  for  the 
lighting  parameters  along  the  diagonal.  The  lighting  stiff¬ 
nesses  are  inversely  proportional  to  the  expected  variance 
in  lighting,  and  estimated  via  statistical  methods  [10,  38]. 

Tracking  is  then  posed  as  a  problem  of  regularized  active 
blob  registration.  For  each  frame,  the  image  template  is 
warped  to  minimize  a  regularized  registration  function: 

E  =  I  p(eu  <r)  +  7at$2a  (3) 

n  i= 1 

ei  =  ||I  (^ij  Vi)  ~~  yi)\\  (4) 

where  V{xi,yi)  is  a  pixel  in  the  warped  template  (Eq.  1), 
I(z* ,  yi )  is  the  pixel  at  the  same  location  in  the  input,  o  and 
7  are  scale  parameters,  and  p  is  an  influence  function  [19]. 

The  influence  function  p  is  also  known  as  a  robust  er¬ 
ror  norm  [6].  It  is  equivalent  to  the  incorporation  of  an 
analog  outlier  process  in  our  objective  function.  This  re¬ 
sults  in  better  robustness  to  specular  highlights  and  oc¬ 
clusions.  In  our  experiments,  we  have  used  the  function 
p(ei,a)  =  log(l  +  ef/(2<72))  [6,  38].  For  efficiency,  the 
log  function  can  be  implemented  via  table  look-up. 

3.2  Robust  Registration  Algorithm 

Registration  requires  minimization  of  residual  error  (Eq.  3) 
with  respect  to  the  deformation  and  lighting  parameters. 
A  common  approach  to  multi-dimensional  minimization 
problems  is  the  Marquardt-Levenberg  method.  Marquardt- 
Levenberg  requires  the  calculation  of  O(N)  gradient  im¬ 
ages  and  0(N2)  image  products  per  iteration  of  minimiza¬ 
tion,  where  N  is  the  number  of  model  parameters.  To  de¬ 
crease  the  number  of  gradient  calculations  needed,  we  can 
use  a  difference  decomposition  [17,  18,  38].  The  approach 
only  requires  the  equivalent  of  0(1)  image  gradient  calcu¬ 
lations  and  O(N)  image  products  per  iteration. 

In  the  difference  decomposition,  a  set  of  difference  im¬ 
ages  is  generated  by  adding  small  changes  to  each  of  the 
blob  parameters.  Each  difference  image  takes  the  form: 

bjb  =Io-W(Io,n*),  (5) 

where  Io  is  the  template  image,  and  n*  is  the  parameter 
displacement  vector  for  the  kth  difference  image,  bfc.  Each 
difference  image  becomes  a  column  in  the  matrix  B.  The 
difference  matrix  can  be  precomputed;  this  is  the  key  to  the 
difference  decomposition' s  speed. 

During  tracking,  an  incoming  image  I  is  inverse  warped 
into  the  blob’s  coordinate  system  using  the  most  recent  es¬ 
timate  of  the  warping  parameters  a.  The  difference  be¬ 
tween  the  inverse-warped  image  and  template  is  then  com¬ 
puted: 

D  =  I0-W-1(I,a).  (6) 

This  difference  image  D  can  be  approximated  in  terms  of  a 
linear  combination  of  the  difference  decomposition' s  vec¬ 
tors:  D  «  Bq,  where  q  is  a  vector  of  coefficients.  Thus, 
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Figure  2:  Tracking  of  a  patch  over  a  number  of  frames  in  a  video 
sequence.  The  patch  outlin  is  shown  in  white.  The  registration  of 
the  image  patch  from  frame  to  frame  implicitly  establishes  cor¬ 
respondence,  allowing  us  to  compute  a  least  squares  estimate  of 
the  local  surface  orientation  and  relative  depth.  The  recovered 
surface  normal  is  shown  displayed  over  top  the  input  sequence. 

the  maximum  likelihood  estimate  of  q  can  be  obtained  via 
least  squares: 

q  =  (B*B)_1B*D.  (7) 

The  change  in  the  image  warping  parameters  is  obtained 
via  matrix  multiplication 

Aa  =  Nq,  (8) 

where  N  has  columns  formed  by  the  parameter  displace¬ 
ment  vectors  n&  used  in  generating  the  difference  basis. 

A  robust  solution  can  be  obtained  through  inclusion  of  a 
diagonal  weighting  matrix  in  Eq.  7: 

q  =  (BtS-1B)-1B<S-1Dl  (9) 

where  entries  in  the  diagonal  matrix  S  take  the  form  su  = 
2a2  +  Df ,  as  derived  from  the  robust  error  norm  p. 

Finally,  the  formulation  can  be  extended  to  include  a 
regularizing  term  that  enforces  the  priors  on  the  model  pa¬ 
rameters.  This  is  accomplished  using  a  constrained  least 
squares  formulation: 

q  =  PD-Qa,  (10) 

where  P  =  [B*S-1B  +  7N‘fr2N]-1  B4S-1  and  Q  = 
7  [BtS-1B  +  7N^2N] -1  N4$2.  If  needed,  this  mini- 
mization  procedure  can  be  iterated  at  each  frame  until  the 
percentage  change  in  the  error  residual  is  below  a  thresh¬ 
old,  or  the  number  of  iterations  exceeds  some  maximum. 

An  example  of  tracking  and  image  patch  via  difference 
decomposition  is  shown  in  Fig.  2.  Image  warping  and  reg¬ 
istration  implicitly  establishes  correspondences  between 
views;  every  pixel  within  an  image  patch  now  has  a  cor¬ 
responding  location  in  the  next  frame.  Given  these  corre¬ 
sponding  pixel  locations,  we  can  recover  estimates  of  local 
planar  structure  and  surface  normal  via  least  squares  [54] 
as  described  in  the  next  section. 


4  Piece- Wise  Planar  Structure  Recovery 


For  a  given  collection  of  corresponding  image  points  in  two 
views,  we  estimate  the  planar  patch's  relative  position  and 
orientation  via  an  algorithm  proposed  by  Weng,  et  al  [54] 
and  similarly  presented  by  Faugeras  in  [15].  The  approach 
employs  a  linear  algorithm  that  yields  a  closed  form  solu¬ 
tion.  The  formulation  is  briefly  restated  here.  We  consider 
this  as  a  preliminary  formulation,  since  it  is  standard  in 
the  literature;  however,  we  plan  to  evaluate  other  methods 
for  planar  structure  recovery  in  future  work.  In  particu¬ 
lar,  multiple  frame  approaches  [5],  constrained  approaches 
[47],  and  more  stable  approaches  [43]  seem  promising. 

Weng,  et  al  [54]  use  an  ideal  pin  hole  camera  model 
with  unit  focal  length.  A  conventional  camera  can  be  cal¬ 
ibrated  so  that  every  point  in  the  actual  image  plane  can 
be  transformed  to  a  point  in  the  image  plane  of  this  nor¬ 
malized  model.  Consider  a  point  on  the  object  that  is  vis¬ 
ible  at  two  time  instants.  The  3D  spatial  position  of  the 
point  in  the  first  instant  is  denoted  x  —  (x,  y,  zf,  and  in 
the  second  x'  —  {x' ,y\  z')*.  The  image  coordinates  of 
the  point,  in  the  first  and  second  images  are  denoted  X  = 


(u,v,  1)*  =  (*,*,  1)*  and  X'  =  (u'y,  1)*  =  (f l)4, 
where  (u,  v)  and  vf)  are  the  image  coordinates  of  the 
point,  in  the  first  and  second  images  respectively.  There¬ 


fore,  the  spatial  vector  and  image  vector  are  related  by 
x  =  *X,x'  =  z'X'. 


The  basic  rigid  motion  equation  that  relates  spatial 
points  at  the  two  time  instances  is: 


x'  =  flx  +  T.  (11) 


where  R  and  T  are  a  rotation  matrix  and  translation  vec¬ 
tor  respectively.  It  is  assumed  that  the  camera  undergoes 
rotation  around  an  axis  going  through  the  origin  followed 
by  a  translation.  It  is  further  assumed  that  the  world  co¬ 
ordinate  system  is  centered  at  the  optical  center.  Note  that 
in  monocular  sequences,  the  translation  vector  T  and  the 
depths  of  the  object  points  z  and  z '  can  only  be  determined 
up  to  a  scale  factor.  Therefore  translation  is  described  in 
terms  of  a  unit  vector  and  depth  estimates  are  simi¬ 
larly  normalized  prjy. 

The  plane  where  the  points  are  located  in  3D  space  can 
be  represented 

N*x  =  1.  (12) 

where  N  is  the  plane's  normal  vector.  The  distance  d  be¬ 
tween  the  origin  and  the  plane  is  d  =  ||N||_1.  Note  that 
d^ 0  thus  excluding  cases  in  which  the  plane  goes  through 
the  origin.  Furthermore,  since  we  can  only  determine  depth 
up  to  a  scale  factor,  we  can  only  determine  the  normal  up 
to  a  scale  factor. 

From  Eqs.  1 1  and  12  we  get 

x'  =  (R  +  TN*)x.  (13) 


We  define  the  intermediate  parameter  matrix : 

F  =  R  +  TN4,  (14) 
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which  can  be  rewritten  in  terms  of  image  vectors: 


z'X'  =  FzX.  (15) 


Applying  a  cross  product  with  Xf  on  both  sides  of  the 
equation  yields: 

X'xFX  =  0.  (16) 

This  can  be  rewritten  in  terms  of  the  product  of  a  matrix 
with  a  vector  that  contains  the  elements  of  intermediate  pa- 
rameter  matrix  h  =  (/u,/i2,  /i3,/23,  •  •  -,f33)u- 


‘  X* 
0 

i/X* 


0 

X* 

— u'X* 


h  =  0. 


-u'X*  ‘ 
—u'X* 

0 


(17) 


The  third  row  is  a  linear  combination  of  the  other  two  and 
thus  can  be  omitted. 

If  we  stack  these  2  rows  n  times  in  a  matrix  where  n  is 
the  number  of  points  we  get  a  2n  x  9  matrix  such  that 


xi 

0 

-<X*  1 

0 

xi 

-«1X* 

X* 

0 

-<x* 

0 

X*n 

-<X*  J 

(IB) 


We  then  solve  for  unit  vector  h  =  min^  1 1  Ah ||,  subject 
to:  ||h||  =  1  If  rank(A)  =  8,  h  can  be  solved  up  to  a  scale 
factor.  Weng  et  al  [54]  show  that  rank(A)  =  8  if  and  only 
if  there  exists  a  set  of  four  object  points  such  that  no  image 
projections  of  any  three  points  in  this  set  are  collinear  in 
any  of  the  two  images.  Then  assuming  rank(A)  =  8  the 
solution  of  h  is  a  unit  eigenvector  of  A1  A  associated  with 
the  smallest  eigenvalue. 

Since  all  the  necessary  information  for  F  is  contained 
in  h  we  are  now  ready  to  solve  for  the  rotation,  transla¬ 
tion,  and  plane  normal  from  F.  There  are  four  cases  to 
consider  corresponding  to  the  multiplicity  of  FlF' s  eigen¬ 
values.  For  brevity,  these  details  are  omitted.  For  the  four 
cases  and  their  geometric  interpretation  see  [54]. 

5  Combining  Surface  Estimates 

The  strategy  is  to  scatter  many  relatively  small  planar/rigid 
image  patch  trackers  throughout  the  image.  Using  the  pro¬ 
cedure  described  above,  a  separate  3D  position  and  ori¬ 
entation  estimate  is  recovered  for  each  image  patch.  It  is 
possible  that  structure  estimates  will  be  noisy.  A  regular¬ 
ized  solution  can  be  obtained  by  combining  the  piece-wise 
shape  estimates  over  local  neighborhoods  via  an  extension 
of  Szeliski  and  Tonnesen’s  oriented  particle  systems  for¬ 
mulation  [16,  46].  Using  this  approach,  complex  surfaces 
are  modeled  as  sets  of  local  surface  elements  that  inter¬ 
act  with  each  other.  Interaction  potentials  are  devised  that 
cause  particles  to  move  into  locally  smooth  arrangements 
subject  to  external  forces  that  are  derived  from  the  image- 
based  piece-wise  structure  estimates. 


Unlike  the  particle  systems  commonly  used  in  computer 
graphics,  our  oriented  particle  system  is  massless.  Instead, 
the  formulation  utilizes  potentials  that  enforce  priors  on 
surface  bending.  This  difference  in  formulation  is  due  to 
the  particular  goal  of  our  application:  regularization  of 
the  piece-wise  planar/rigid  structure  estimates.  Following 
[16,  46],  we  define  a  co-normality  potential  </>$  and  co¬ 
planarity  potential  <j)fj  between  particles  i  and  j: 

4>ij  =  1  —  Hi  •  n,- ,  (19) 

4>fj  =  (ni-ry)2  +  (niTy)2,  (20) 

where  n*  and  n j  are  the  unit  normals  for  two  piece-wise 
planar  patches,  and  is  the  vector  connecting  the  two 
patch  centers.  These  two  terms  determine  the  surface’s  re¬ 
sistance  to  bending. 

In  the  simulation,  the  potentials  are  combined  in  an  in¬ 
ternal  energy  term  that  sums  the  inter-particle  energies: 

E internal  =  ^ +  0i(l>ij)P{rij) )  (21) 

hj 

where  a  is  a  scale  factor  that  controls  the  relative  im¬ 
portance  of  the  terms,  and  /3  is  a  monotonically  decreas¬ 
ing  function  used  to  limit  the  range  of  the  forces  and 
torques  derived  from  the  potential  energy  function.  For 
this  application,  the  function  we  use  is  /3(Uj)  =  max(l  - 
I  \rij  | \m/dm ,  0) ,  where  d  is  the  desired  falloff  distance,  and 
m  controls  the  rate  of  falloff. 

Due  to  this  falloff,  a  particle  is  affected  by  forces  and 
torques  exerted  by  the  other  particles  only  within  its  local 
neighborhood  A/*.  Equations  for  the  forces  and  torques  can 
be  found  in  [46].  For  numerical  conditioning,  a  damping 
term  is  added  to  both  force  and  torque  equations. 

To  gain  a  regularized  estimate  of  the  piece- wise  surface, 
we  run  a  particle  simulation.  We  define  two  sets  of  par¬ 
ticles  in  the  simulation:  surface  particles  and  data  parti¬ 
cles.  One  surface  particle  and  one  data  particle  are  defined 
for  each  piece-wise  planar  surface  estimated  in  the  image. 
The  initial  value  of  each  surface  and  data  particle  is  the  po¬ 
sition  and  orientation  estimated  via  tracking  as  described 
in  Sec.  4.  Data  particles  remain  fixed  during  the  simula¬ 
tion,  while  surface  particles  are  free  to  move.  Each  pair  of 
data  and  surface  particles  can  be  joined  by  a  linear  spring. 

The  particle  system’s  behavior  is  described  by  an  ordi¬ 
nary  differential  equation  [46],  and  integrated  in  time  via 
Euler's  method  until  the  change  in  the  potential  energy  be¬ 
tween  iterations  goes  beneath  a  threshold.  The  regularized 
piece-wise  surface  is  taken  as  the  position/orientation  of 
the  surface  particles  at  the  end  of  the  simulation. 

It  is  possible  that  there  are  depth  discontinuities  present 
in  the  scene,  and  therefore  particles  may  lie  on  different 
sides  of  a  depth  discontinuity.  The  forces  that  bind  parti¬ 
cles  should  therefore  be  modeled  as  springs  that  break  apart 
if  particles  are  too  far  out  of  alignment  [16]. 

The  advantage  of  using  the  oriented  particle  system  ap¬ 
proach  is  that  it  requires  no  a  priori  knowledge  of  the 
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piece- wise  surface's  topology.  One  disadvantage  is  that  the 
approach  requires  careful  parameter  setting.  Furthermore, 
the  computational  complexity  of  simulation  is  prohibitive 
for  large  particle  systems;  each  update  of  the  system  re¬ 
quires  the  calculation  of  0(n2)  inter-particle  forces.  The 
complexity  issue  can  be  addressed  through  the  use  of  spa¬ 
tial  data  structures  [46]. 

6  “Good”  Image  Patches  to  Track 

Piece- wise  structure  recovery  depends  on  the  registration 
of  deforming  image  patches  from  frame  to  frame.  In  our 
proposed  system,  the  strategy  is  to  track  many  patches  at 
a  time.  Some  patches  will  be  relatively  “good”  and  will 
allow  accurate  tracking  of  deformation.  Other  patches 
may  present  problems  in  deformable  region  tracking,  and 
should  be  detected. 

For  instance,  some  image  patches  may  have  relatively 
low  contrast  and  therefore  will  be  unfit  for  tracking.  More 
generally,  we  need  to  anticipate  and  deal  with  the  aperture 
problem  in  estimating  patch  motion.  At  each  pixel,  it  is 
only  possible  to  estimate  that  component  of  image  velocity 
that  is  orthongal  to  an  image  isobrightness  contour.  One 
solution  to  this  problem  is  to  calculate  motion  over  larger 
image  patches.  Since  we  are  tracking  relatively  large  image 
patches  (on  the  order  of  16  x  16  or  32  x  32  pixels),  it  is 
often  possible  to  resolve  the  aperture  problem,  assuming 
sufficient  image  contrast. 

However,  in  general,  there  will  still  be  some  image 
patches  for  which  it  is  impossible  to  reliably  estimate  the 
motion  parameters  due  to  the  aperture  problem.  In  certain 
cases,  parameter  estimates  may  be  ambiguous  or  undercon¬ 
strained.  This  is  a  generalization  of  the  aperture  problem 
[18].  It  effects  not  only  estimates  of  translational  motion, 
but  estimates  of  deformational  motion  as  well.  It  may  be 
possible  to  reliably  estimate  only  a  subset  of  deformation 
parameters  given  an  image  patch  of  a  particular  texture. 
This  ambiguity  can  be  detected  by  computing  the  rank  of 
the  matrix  B*B  employed  in  image  registration  (Sec.3.2). 
If  this  matrix  is  rank  deficient,  then  there  will  be  an  inher¬ 
ent  ambiguity  in  tracking  for  that  patch. 

More  generally,  B*B  serves  as  the  estimated  covariance 
matrix  of  the  standard  errors  in  the  recovered  registration 
parameters  for  each  patch.  These  covariances  could  be  in¬ 
corporated  directly  into  the  structure  recovery  and  in  the 
oriented  particle  simulation.  This  would  allow  resolution 
of  possible  ambiguities  by  pooling  over  neighborhoods, 
and  is  saved  as  future  work. 

Unfortunately,  even  a  “good”  image  patch  can  be  diffi¬ 
cult  to  track  if  it  lies  on  a  depth  discontinuity,  across  the 
boundary  of  a  specular  highlight,  or  if  it  is  occluded  dur¬ 
ing  tracking.  The  use  of  the  influence  function  formulation 
in  registration  provides  improved  robustness  to  these  ef¬ 
fects.  The  particular  robust  error  norm  employed  reaches 
its  theoretical  break  down  point  when  the  number  of  out¬ 
liers  exceeds  50%.  As  suggested  by  [50],  patches  that 


straddle  depth  discontinuities  can  be  detected  by  inspect¬ 
ing  the  residual  error  in  registration  at  each  step. 

7  Preliminary  Experiments 

To  test  the  capabilities  of  our  proposed  framework,  we  built 
an  experimental  implementation  of  the  piece- wise  planar 
tracking  system.  Our  system  was  implemented  on  an  SGI 
02.  with  a  180Mhz  R5K  processor,  128MB  RAM.  At  this 
time,  only  the  tracking  and  piece-wise  structure  modules 
have  been  fully-tested.  The  particle  system  module  has  un¬ 
dergone  preliminary  testing  with  planar  motion  sequences. 
Full  integration/evaluation  of  the  particle  system  module  is 
expected  for  the  final  version  of  this  paper. 

The  basic  piece- wise  structure  approach  was  tested  on 
synthetic  sequences  in  which  ground  truth  was  known. 
The  experimental  setup  for  generating  synthetic  sequences 
was  as  follows.  A  polygonal,  texture  mapped  model  was 
rendered  under  perspective  projection  using  OpenGL  at 
128  x  128  resolution.  The  resulting  image  sequence  was 
then  used  as  a  test  sequence.  For  visualization  purposes, 
the  recovered  normal  and  patch  location  were  then  dis¬ 
played  overlaid  on  the  input  frames.  Additional  ortho¬ 
graphic  views  were  displayed  for  ease  of  viewing. 

The  system  was  tested  on  approximately  twenty  syn¬ 
thetic  sequences  under  varying  amounts  of  rotation,  scal¬ 
ing,  translation,  and  deformation.  Two  different  3D  defor¬ 
mation  functions  were  used:  quadratic  bending,  and  heli¬ 
cal  twisting.  Illumination  was  kept  fixed,  since  previous 
experiments  with  active  blobs  [38]  already  demonstrated 
showed  robustness  of  the  tracker  to  illumination.  Each  im¬ 
age  region  tracked  was  32  x  32  pixels  in  size. 

Results  for  two  different  synthetic  sequences  are  shown 
in  Figs.  3  and  4.  In  both  figures,  the  first  frame  in  the  input 
sequence  is  shown  in  (a),  with  the  initial  position  of  im¬ 
age  patches  shown  overlaid  in  white.  Subsequent  frames 
in  the  sequence  are  shown  in  (b).  Ground  truth  normals  are 
shown  in  green.  Estimated  normals  are  shown  in  red.  To 
better  visualize  the  result,  orthographic  views  of  the  sur¬ 
face  an  normals  are  shown  below  each  image  in  the  se¬ 
quence  (c,d). 

Since  the  polygonal  model  and  the  deformation  were 
known,  ground  truth  structure  and  normal  information  was 
readily  available.  This  allowed  us  to  compute  error  in  ori¬ 
entation  estimates.  Throughout  the  synthetic  sequences 
tested,  the  dot  product  between  the  estimated  and  ground 
truth  normals  had  an  average  value  of  0.97  (15°). 

The  system  has  also  been  tested  on  real  image  sequences 
of  deformable  objects  in  motion.  Frames  taken  from  a 
tracking  sequence  of  piece  of  a  foam  rubber  block  deform¬ 
ing  are  shown  in  Fig.  5.  As  before,  tracked  regions  are 
shown  outlined  in  white  and  estimated  normals  are  shown 
in  red  (displayed  under  perspective  projection).  As  can  be 
seen,  the  results  look  reasonable  despite  the  large  defor¬ 
mation  and  nonrigidity.  We  expect  that  the  results  will  im¬ 
prove  further  with  inclusion  of  the  particle  system  module. 
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Figure  3:  Example  tracking  with  synthetic  sequence:  twisting.  A  perspective  image  sequence  was  generated  for  a  deforming  plane.  The 
first  frame  in  the  input  sequence  is  shown  in  (a),  with  the  initial  position  of  image  patches  shown  overlaid  in  white.  Frames  taken  from 
later  in  the  input  sequence  are  shown  in  (b).  Ground  truth  normals  are  shown  in  green.  Estimated  normals  are  shown  in  red.  To  better 
visualize  the  result,  corresponding  orthographic  side  views  are  shown  below  each  image  in  the  sequence  (c,d). 


Figure  4:  Second  example  of  tracking  with  synthetic  sequence:  quadratic  bending  of  planar  sheet.  A  perspective  image  sequence  was 
generated  and  piece-wise  model  estimates  were  obtained  as  in  previous  example.  The  first  frame  in  the  input  sequence  is  shown  in  (a), 
with  the  initial  position  of  image  patches  shown  overlaid  in  white.  Subsequent  frames  are  shown  in  (b).  As  before,  ground  truth  normals 
are  shown  in  green  and  estimated  normals  are  shown  in  red.  Corresponding  orthographic  top  views  are  shown  below  each  image(c,d). 


Figure  5:  Example  tracking  with  a  real  image  sequence:  a  foam  rubber  block  deforming.  As  before,  tracked  regions  are  shown  outlined 
in  white  and  estimated  normals  are  shown  in  red  (displayed  under  perspective  projection). 
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